In [ ]:
 

Mel Schwan, Stuart Miller, Justin Howard, Paul Adams

Lab One: Visualization and Data Preprocessing

CRISP-DM stands for cross-industry process for data mining. It provides a structured approach to planning a data mining project. It is a robust and well-proven methodology.

For Lab One we will use the first two portions of this methodology. We will start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and project plan. Next we will identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.

In future labs we will execute all of the CRISP-DM steps.

Screenshot-2016-04-20-11.58.54.png

1. Stage One - Determine Business Objectives and Assess the Situation

The first stage of the CRISP-DM process is to understand what you want to accomplish from a business perspective. We will define our objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of our project.

1.1 Assess the Current Situation (Q1A)

This analysis of Home Credit's Default Risk dataset will focus on generating accurate loan default risk probabilities. Predicting loan defaults is essential to the profitability of banks and, given the competitive nature of the loan market, a bank that collects the right data can offer and service more loans. The target variable of the dataset is the binary label, 'TARGET', indicating whether the loan entered into default status or not.

1.1.1. Inventory of resources

List the resources available to the project including:

  • Personnel:
    • Mel Schwan
    • Stuart Miller
    • Justin Howard
    • Paul Adams
  • Data:
    • 307,511 individual loans
    • application_{train|test}.csv
    • bureau.csv
    • bureau_balance.csv
    • POS_CASH_balance.csv
    • credit_card_balance.csv
    • previous_application.csv
    • installments_payments.csv
    • HomeCredit_columns_description.csv |
  • Computing resources:
    • Four personnel computers
  • Software:
    • Jupyter Notebook
    • Library Packages
      • Pandas
      • Pandas_profile
      • Seaborn
      • Numpy
      • Mathplotlib
      • Tables
      • Cleaning (custom package)
      • Sklearn

1.1.2. Requirements -

  • Requirements
    • Perform analysis of a data set: exploring the statistical summaries of the features, visualizing the attributes, and making conclusions from the visualizations and analysis. Follow the CRISP-DM framework in your analysis (you are not performing all of the CRISP-DM outline)
    • Demostrate the teams understanding of Homecredits Business and the data model to best predict default on loans

1.1.3.Risks and contingencies

  • Imputation of data may result in false interpetation of the real data values
  • Elimintation of features can eliminate key factors in predicting loan defaults

1.1.4.Terminology

  • Assets - Valuable resources owned by a business, which were acquired at a measurable money cost.
  • Bad debt - Debts that are not collectible and therefore, proves to be of no worth to the creditor.
  • Bridge Loan - Refers to short-term loan to fund temporary needs as long as permanent financing is not available.
  • Capitalization Rate - A rate of interest (discount) employed to calculate the present value of future cash flows. This is the rate that investors expect.
  • Credit Analysis - The process of determining whether an applicant satisfies the credit standards of a firm and the amount of credit that should be extended to him.
  • Debt Service Coverage Ratio - Ratio of total cash operating funds available to service debt obligations.
  • Default Risk - Uncertainty of expected returns from a security attributable to possible changes in the financial capacity of the issuer to make future payments to the security owner.
  • Delinquent - A situation where one fails to make payment, which can lead to foreclosure.
  • Line of Credit - A bank agreement stating a company may borrow at any time up to the specified limit.
  • Net Working Capital/Net Worth - The excess of current assets over current liabilities.Negotiable
    • Certificates of Deposits
    • It is a marketable receipt of funds deposited in a bank for a fixed period.
    • Off-Balance Sheet Financing
    • Hidden form of debt without being shown as a liability.
  • Principal - An amount of money upon which interest is calculated.
  • Roll Over - A situation where parties to the contract agree to carry over the loan for another stated period at the time of maturity.
  • Security - Any thing offered or given to fulfill the performance of a contract. For example, real estate, stocks, fixed assets, jewelry, etc.
  • Term - The tenure of the policy.
  • Working capital - Current Assets plus Current Liabilities.
  • Write off - To consider as a loss or failure in case of uncollectible investments.

1.1.5.Costs and benefits (Q1B)

  • Cost

    • Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
  • Benefits

    • Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

1.2 What are the desired outputs of the project? (Q1C)

Business success criteria

  • Cost
  • Reduce default rates to 10 precent

Data mining success criteria

  • Build a data model that has a ROC value of .80 - .90

1.3 What Questions Are We Trying To Answer?

Use Home Credit current customer data to predict whether a potential client is capable of repayment of the loan requested. During this process we will determine the features that are most influencial in determining this target variable.

2. Stage Two - Data Understanding

The second stage of the CRISP-DM process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate these.

2.1 Initial Data Report (Q2) .

Initial data collection report - List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.

In [1]:
# Import Libraries Required.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

# import custom code
from cleaning import read_clean_data, missing_values_table, load_bureau, create_newFeatures, merge_newFeatures 
from tables import count_values_table

# some defaults
pd_max_rows_default = 60
In [2]:
# load data 
# path =  './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df =  pd.read_csv('./application_train.csv',
                 na_values = ['XNA'])
#loading bureau dataset

bureau = pd.read_csv('./bureau.csv',
                    na_values = ['XNA'])
#bureau = load_bureau()
#newFeatures = create_newFeatures(bureau)
#df = df.merge(newFeatures, on = 'SK_ID_CURR', how = 'left')

2.2 Describe Data (Q2A)

Data description report - Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.

We will use two of the files from the total dataset.

  • application_train.csv: Information provided with each loan application
  • bureau.csv: Information regarding clients from the credit bureaus

The two data files can be joined on the loan id (SK_ID_CURR).

Loan Application Features

There are 122 features and 307511 observations in application_train.csv.

In [3]:
df.shape
Out[3]:
(307511, 122)

The following features are loan application attributes from application_train.csv. Descriptions, types, and and units are given for each feature. In the original dataset, there were a large number of features related to building features included with the loan applications. However, these features could not be used due to the large number of missing values.

Feature Description Type Units
SK_ID_CURR ID of loan in our sample Category N/A
TARGET Target Variable (1 - difficulty paying loan, 0 - all other cases) Category N/A
NAME_CONTRACT_TYPE Identification if loan is cash or revolving Category N/A
CODE_GENDER Gender of the client (M - male, F - female) Category N/A
FLAG_OWN_CAR Flag if the client owns a car Category N/A
FLAG_OWN_REALTY Flag if client owns a house or flat Category N/A
CNT_CHILDREN Number of children the client has Coninuous N/A
AMT_INCOME_TOTAL Income of the client Coninuous Currency
AMT_CREDIT Credit amount of the loan Coninuous Currency
AMT_ANNUITY Loan annuity Coninuous Currency
AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given Coninuous Currency
NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan Category N/A
NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave) Category N/A
NAME_EDUCATION_TYPE Level of highest education the client achieved Category N/A
NAME_FAMILY_STATUS Family status of the client Category N/A
NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...) Category N/A
REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) Coninuous Days
DAYS_BIRTH Client's age in days at the time of application Coninuous Days
DAYS_EMPLOYED How many days before the application the person started current employment Coninuous Days
DAYS_REGISTRATION How many days before the application did client change his registration Coninuous Days
DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan Coninuous Days
OWN_CAR_AGE Age of client's car Coninuous Months
FLAG_MOBIL Did client provide mobile phone (Y, N) Category N/A
FLAG_EMP_PHONE Did client provide work phone (Y, N) Category N/A
FLAG_WORK_PHONE Did client provide home phone (Y, N) Category N/A
FLAG_CONT_MOBILE Was mobile phone reachable (Y, N) Category N/A
FLAG_PHONE Did client provide home phone (Y, N) Category N/A
FLAG_EMAIL Did client provide email (Y, N) Category N/A
CNT_FAM_MEMBERS What kind of occupation does the client have Category N/A
OCCUPATION_TYPE How many family members does client have Category N/A
REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3) Category N/A
REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3) Category N/A
WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan Category N/A
HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan Category N/A
REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) Category N/A
REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level) Category N/A
LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level) Category N/A
REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) Category N/A
REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level) Category N/A
LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level) Category N/A
ORGANIZATION_TYPE Type of organization where client works Category N/A
EXT_SOURCE_1 Normalized score from external data source Coninuous N/A
EXT_SOURCE_2 Normalized score from external data source Coninuous N/A
EXT_SOURCE_3 Normalized score from external data source Coninuous N/A
OBS_30_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings with observable 30 DPD (days past due) default Coninuous N/A
DEF_30_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings defaulted on 30 DPD (days past due) Coninuous N/A
OBS_60_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings with observable 60 DPD (days past due) default Coninuous N/A
DEF_60_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings defaulted on 60 (days past due) DPD Coninuous N/A
DAYS_LAST_PHONE_CHANGE How many days before application did client change phone Coninuous N/A
FLAG_DOCUMENT_2 Did client provide document 2 Category N/A
FLAG_DOCUMENT_3 Did client provide document 3 Category N/A
FLAG_DOCUMENT_4 Did client provide document 4 Category N/A
FLAG_DOCUMENT_5 Did client provide document 5 Category N/A
FLAG_DOCUMENT_6 Did client provide document 6 Category N/A
FLAG_DOCUMENT_7 Did client provide document 7 Category N/A
FLAG_DOCUMENT_8 Did client provide document 8 Category N/A
FLAG_DOCUMENT_9 Did client provide document 9 Category N/A
FLAG_DOCUMENT_10 Did client provide document 10 Category N/A
FLAG_DOCUMENT_11 Did client provide document 11 Category N/A
FLAG_DOCUMENT_12 Did client provide document 12 Category N/A
FLAG_DOCUMENT_13 Did client provide document 13 Category N/A
FLAG_DOCUMENT_14 Did client provide document 14 Category N/A
FLAG_DOCUMENT_15 Did client provide document 15 Category N/A
FLAG_DOCUMENT_16 Did client provide document 16 Category N/A
FLAG_DOCUMENT_17 Did client provide document 17 Category N/A
FLAG_DOCUMENT_18 Did client provide document 18 Category N/A
FLAG_DOCUMENT_19 Did client provide document 19 Category N/A
FLAG_DOCUMENT_20 Did client provide document 20 Category N/A
FLAG_DOCUMENT_21 Did client provide document 21 Category N/A
AMT_REQ_CREDIT_BUREAU_HOUR Number of enquiries to Credit Bureau about the client one hour before application Category N/A
AMT_REQ_CREDIT_BUREAU_DAY Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) Category N/A
AMT_REQ_CREDIT_BUREAU_WEEK Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) Category N/A
AMT_REQ_CREDIT_BUREAU_MON Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) Category N/A
AMT_REQ_CREDIT_BUREAU_QRT Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application) Category N/A
AMT_REQ_CREDIT_BUREAU_YEAR Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application) Category N/A

Bureau Data Features

Features from the bureau data are not used directly; new features are engineer. Descriptions for the base features that are used in feature engineering are described here. For details on the engineered features see section 2.5 Feature Engineering.

Feature Description Type Units
SK_ID_CURR ID of loan in our sample Category N/A
CREDIT_ACTIVE The statue of reported credits (Open or Close) Category N/A
CREDIT_DAY_OVERDUE The number of days a given credit is over due Continuous Days
AMT_CREDIT_SUM Current credit amount for the Credit Bureau credit Continuous Currency
AMT_CREDIT_SUM_DEBT Current debt on Credit Bureau credit Continuous Currency
AMT_CREDIT_SUM_LIMIT Current credit limit of credit card reported in Credit Bureau Continuous Currency
AMT_CREDIT_SUM_OVERDUE Current amount overdue on Credit Bureau credit Continuous Currency

2.3 Verify Data Quality

Note: All this resolutions to data quality described in this section are implemented in cleaning.py, which is used to load the cleaned dataset.

2.3.1. Missing Data

In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem

In [4]:
# create missing values table with all rows
pd.set_option('display.max_rows', 129)
missing_values_table(df)
Your selected dataframe has 122 columns.
There are 69 columns that have missing values.
Out[4]:
Missing Values % of Total Values
COMMONAREA_AVG 214865 69.9
COMMONAREA_MEDI 214865 69.9
COMMONAREA_MODE 214865 69.9
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_AVG 213514 69.4
NONLIVINGAPARTMENTS_MODE 213514 69.4
FONDKAPREMONT_MODE 210295 68.4
LIVINGAPARTMENTS_AVG 210199 68.4
LIVINGAPARTMENTS_MEDI 210199 68.4
LIVINGAPARTMENTS_MODE 210199 68.4
FLOORSMIN_MEDI 208642 67.8
FLOORSMIN_AVG 208642 67.8
FLOORSMIN_MODE 208642 67.8
YEARS_BUILD_MODE 204488 66.5
YEARS_BUILD_MEDI 204488 66.5
YEARS_BUILD_AVG 204488 66.5
OWN_CAR_AGE 202929 66.0
LANDAREA_MEDI 182590 59.4
LANDAREA_AVG 182590 59.4
LANDAREA_MODE 182590 59.4
BASEMENTAREA_MODE 179943 58.5
BASEMENTAREA_MEDI 179943 58.5
BASEMENTAREA_AVG 179943 58.5
EXT_SOURCE_1 173378 56.4
NONLIVINGAREA_MODE 169682 55.2
NONLIVINGAREA_MEDI 169682 55.2
NONLIVINGAREA_AVG 169682 55.2
ELEVATORS_MODE 163891 53.3
ELEVATORS_AVG 163891 53.3
ELEVATORS_MEDI 163891 53.3
WALLSMATERIAL_MODE 156341 50.8
APARTMENTS_MODE 156061 50.7
APARTMENTS_MEDI 156061 50.7
APARTMENTS_AVG 156061 50.7
ENTRANCES_AVG 154828 50.3
ENTRANCES_MEDI 154828 50.3
ENTRANCES_MODE 154828 50.3
LIVINGAREA_MEDI 154350 50.2
LIVINGAREA_MODE 154350 50.2
LIVINGAREA_AVG 154350 50.2
HOUSETYPE_MODE 154297 50.2
FLOORSMAX_AVG 153020 49.8
FLOORSMAX_MODE 153020 49.8
FLOORSMAX_MEDI 153020 49.8
YEARS_BEGINEXPLUATATION_MEDI 150007 48.8
YEARS_BEGINEXPLUATATION_MODE 150007 48.8
YEARS_BEGINEXPLUATATION_AVG 150007 48.8
TOTALAREA_MODE 148431 48.3
EMERGENCYSTATE_MODE 145755 47.4
OCCUPATION_TYPE 96391 31.3
EXT_SOURCE_3 60965 19.8
ORGANIZATION_TYPE 55374 18.0
AMT_REQ_CREDIT_BUREAU_MON 41519 13.5
AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.5
AMT_REQ_CREDIT_BUREAU_QRT 41519 13.5
AMT_REQ_CREDIT_BUREAU_DAY 41519 13.5
AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.5
AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.5
NAME_TYPE_SUITE 1292 0.4
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.3
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.3
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.3
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.3
EXT_SOURCE_2 660 0.2
AMT_GOODS_PRICE 278 0.1
AMT_ANNUITY 12 0.0
CODE_GENDER 4 0.0
CNT_FAM_MEMBERS 2 0.0
DAYS_LAST_PHONE_CHANGE 1 0.0
In [5]:
# return row display setting to default
pd.set_option('display.max_rows', pd_max_rows_default)

Analysis of Missing Data in Each Feature

Each feature or set of features will be discussed in descending order of the rate of missing values.

Building Features with Missing Values
A large number of the features with values are normlaized infromation about the build where the client lives (such as ENTRANCES_MODE or COMMONAREA_AVG). It is plausible that only some of these features exist in a given client's building. It is also plausible that some of the records are missing (not provided by the client). There does not appear to be an indication if values are missing or not applicable to the client. All of these values have a missing rate above 40%.

Missing values in OWN_CAR_AGE
This feature refers to the age of cars owned by the client. Approximately 66.0% of the values are empty. However, there is also a feature FLAG_OWN_CAR, which indicates that the client owns a car. It is reasonable to expect that values will be missing if the client does not own a car. For clients reporting owning a car, all but 5 car age values are present. For clients reporting not owning a car, no car ages are reported. Since the missing rate is actually very small, these missing values could be imputed.

In [6]:
# get indexes of OWN_CAR_AGE that are not NA
car_age_indexes = df[df.OWN_CAR_AGE.notnull()].index
# get indexes of FLAG_OWN_CAR with value equal to Y (client owns car)
owns_car_indexes = df.query('FLAG_OWN_CAR == "Y"').index
print('There are {} records with ages for cars'.format(car_age_indexes.shape[0]))
print('There are {} records indicating that the client owns a car'.format(owns_car_indexes.shape[0]))
# get the number of non-null car age values on records where client does not list a car
car_age_not_own_car = df.query('FLAG_OWN_CAR == "N"').OWN_CAR_AGE.notnull().sum()
print('There are {} car ages reported for clients that report NOT owning a car'.format(car_age_not_own_car))
There are 104582 records with ages for cars
There are 104587 records indicating that the client owns a car
There are 0 car ages reported for clients that report NOT owning a car

Missing values in EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3
Since there are zeros in these columns, we expect that the missing values represent lack of the external source these clients. Therefore, we will add an encoding feature EXT_SOURCE_<number>_AV for each of the external source features that represents the presence of external information. These original columns will only be used as an interaction with these encoding features and the missing values will be filled with zero.

  • EXT_SOURCE_1 has a missing rate of 56.4%
  • EXT_SOURCE_2 has a missing rate of 0.2%
  • EXT_SOURCE_3 has a missing rate of 19.8%

Missing values in OCCUPATION_TYPE
There does not appear to be a indication that OCCUPATION_TYPE is systematically missing. We will assume that the client did not provide this information and impute with a new categorical level Unknown. OCCUPATION_TYPE has a missing rate of 31.3%.

Missing values in ORGANIZATION_TYPE
Later in the outliers section, it is shown that the NAs in ORGANIZATION_TYPE are associated with occupations listed as 'Unemployed' or 'Pensioners'. We will take this to mean that these clients are not assocated with any employer and will impute these NAs with 'None'. ORGANIZATION_TYPE has a missing rate of 18%.

Missing values for AMT_REQ_CREDIT_BUREAU Features
These features indicate the number of credit enquiries on the client in a given time period: hour, day, week, month, quarter, and year. There does not appear to be a systematic reason the missing values in these features. We will treat these records as if there are no credit enquires for these records and impute with zero. These features have a missing rate of 13.5%.

Remaining Features with Low Missing Rate
The remaining missing features are missing at rates below 0.5%. Imputation on these features should have only a small impact on the total dataset. We will use standard imputation strategies for these features: imputation of the mode for categorical features and imputation of the median for continuous features. We are imputing with continuous features median rather than mode because the continuous features are skewed and median is not impacted by large values in the features.

In [7]:
pd.set_option('display.max_rows', 122)
data = read_clean_data()
missing_values_table(data)
Your selected dataframe has 82 columns.
There are 0 columns that have missing values.
Out[7]:
Missing Values % of Total Values
In [8]:
# load the bureau dataset if not loaded
if 'bureau' not in globals():
    bureau = pd.read_csv('./bureau.csv', na_values = ['XNA'])

#engineering features from bureau dataset
newFeatures = create_newFeatures(bureau)
newFeatures.head()
newFeatures = newFeatures.fillna(0)
# merge datasets on loan ID
data = data.merge(newFeatures, on = 'SK_ID_CURR', how = 'left')
# fill all NAs with 0 in the new columns
fn_columns = list(newFeatures.columns)
for name in fn_columns:
    data[name] = data[name].fillna(0)
In [9]:
missing_values_table(data)
Your selected dataframe has 89 columns.
There are 0 columns that have missing values.
Out[9]:
Missing Values % of Total Values

2.3.2. Outliers

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

  • Below the first quartile − 3 ∗ interquartile range
  • Above the third quartile + 3 ∗ interquartile range

High Values in DAYS_EMPLOYED

There are a large number of entries for DAYS_EMPLOYED outside the main distribution. These entries are at value 365243 and there are 55374 instances.

In [10]:
df.DAYS_EMPLOYED.hist(bins = 50);
In [11]:
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.shape
Out[11]:
(55374,)
In [12]:
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.head()
Out[12]:
8     365243
11    365243
23    365243
38    365243
43    365243
Name: DAYS_EMPLOYED, dtype: int64

These high values for DAYS_EMPLOYED appear to be associated with clients that are "Pensioners" or Unemployed and do not list an employment orgainization (ORGANIZATION_TYPE).

In [13]:
# get the instances with NAME_INCOME_TYPE either Pensioner or Unemployed
filtered_index = df.query('NAME_INCOME_TYPE == "Pensioner" | NAME_INCOME_TYPE == "Unemployed"')
# filter to NAs for ORGANIZATION_TYPE and get the index of the array
filtered_index = df[df.ORGANIZATION_TYPE.isna()].index
# get the indexes of the high values
high_val_index = df.query("DAYS_EMPLOYED >= 100000").index
# assert that indexes are the same
if np.equal(filtered_index, high_val_index).all():
    print('Index of queried values are the same.')
else:
    print('Indexes of queried values are different.')
Index of queried values are the same.

Possible Solution

A possible solution to this data quality issue would be to add a None level to ORGANIZATION_TYPE and only use DAYS_EMPLOYED as an interaction with ORGANIZATION_TYPE.

Data Cleaning Script

All the cleaning discussed in the sections above are implemented in cleaning.py. This script contains a function (read_clean_data) to apply the cleaning steps and return the cleaned dataset for work.

Details

  • Cleaning
    • Read csv with Pandas (setting correct data types)
    • Drop columns that will not be used
    • Recode NA values that are not listed as np.nan
    • Formattings
    • Encode categorical variables
  • Returns
    • DataFrame with cleaned data

Sample Output

In [14]:
data = read_clean_data()
data.head(2)
Out[14]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR EMPLOYED EXT_SOURCE_1_AV EXT_SOURCE_2_AV EXT_SOURCE_3_AV CREDIT_INCOME_RATIO ANNUITY_INCOME_RATIO PERCENT_EMPLOYED_TO_AGE
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 1 1 1 1 1 2.007889 0.121978 0.067329
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 1 1 1 0 4.790750 0.132217 0.070862

2 rows × 82 columns

2.4 Cleaned Dataset Simple Statistics Profile (Q2C)

In [15]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
In [16]:
# Generate the pandas_profile report
#profile = ProfileReport(data, )
profile = ProfileReport(data, minimal=True)
In [17]:
#Create HTML Frame in notebook
profile.to_notebook_iframe()

2.4.1 Distributions (Q2D)

This part of the exploration focused on the use of box plots and histograms for visualing continuous variables and bar charts for visualizing categorical variables. Variables we expect to be important to the important to analysis were selected for univariate visualization.

In [18]:
fig, axes = plt.subplots(2,2, figsize=(10, 10))
# histogram of total income
axes[0,0].hist(np.log10(data.AMT_INCOME_TOTAL), bins = 30)
axes[0,0].set_xlabel('log10 of Total Income')
axes[0,0].set_title('Histogram of Total Income (Log-Transformed)');
# boxplot of total income
axes[1,0].boxplot(np.log10(data.AMT_INCOME_TOTAL), vert = False)
axes[1,0].set_xlabel('log10 of Total Income')
axes[1,0].set_title('Boxplot of Total Income (Log-Transformed)');

# histogram of annuity
axes[0,1].hist((data.AMT_ANNUITY), bins = 30)
axes[0,1].set_xlabel('Amount Annuity')
axes[0,1].set_title('Histogram of Amount Annuity');
# boxplot of annuity
axes[1,1].boxplot((data.AMT_ANNUITY), vert = False)
axes[1,1].set_xlabel('Amount Annuity')
axes[1,1].set_title('Boxplot of Amount Annuity');

Financial Features

We expect that financial features will be important to the analysis. We expect that features such as AMT_INCOME_TOTAL would be relevant for a client's ability to payback a loan. Additionally, a client's income, especially large values and small values may influence a bank's willness to provide credit to a client. However, clients with extremely large income may receive special treatment due to the potential value.

Due to the skewed nature of these distributions, a histgram and a boxplot are shown. The histogram shows the features of the main distribution, while the behavior of the tails and extreme values are shown in the boxplots.

AMT_INCOME_TOTAL

For ease of visualization, AMT_INCOME_TOTAL was transformed with log base-10. This is due to influence from some very large observations on histogram binning. As shown in the pair of plots for AMT_INCOME_TOTAL, most of the values are clustered just above 100,000 (5). The main section of the log transformed distribution appears to relatively close to a normal (the distribution is right-skewed on the original scale). Several extreme values are shown in the box plot. The most extreme value is approximately 1000 times larger than the median and mean of the distribution.

AMT_ANNUITY

Like AMT_INCOME_TOTAL, AMT_ANNUITY has a right skewed distribution. However, this feature could be plotted on the original scale because the scale is not as large as AMT_INCOME_TOTAL. The histogram shows that most of the values are between 0 and 50,000. The boxplot shows that there are some exteme values outside the main distribution. The largest extreme value is approximately 10 times the median value.

For the finacial features we expect that some kind of transformation may be necessary to reduce the influence of the extreme values. We may also consider removing very large outliers as they may be treated differently by the bank. For instance, clients with very large incomes may be work personally with a bank representative rather than interacting with an automated system.

In [19]:
fig, axes = plt.subplots(1,2, figsize=(10, 5))
# histogram of age
axes[0].hist(data.DAYS_BIRTH / 365, bins = 35)
axes[0].set_xlabel('Client Age (Years)')
axes[0].set_title('Histogram of Client Age');
# histogram of employment tenure
# filter out the clients that do not appear to be employed
employed = data.query('NAME_INCOME_TYPE != "Pensioner" & NAME_INCOME_TYPE != "Unemployed"')
axes[1].hist((employed.DAYS_EMPLOYED / 365), bins = 30)
axes[1].set_xlabel('Tenure with Current Employer (Years)')
axes[1].set_title('Histogram of Client Tenure with Employer');

DAYS_BIRTH

We expect that client age may be an important factor in determining a client's ability to pay back a loan. Higher ages may represent more time to accumulate wealth and more experience dealing with financial matter. The ages are transformed into years (the original scale is days) because years is a more interpretable unit than days when the values are large. Client ages appear to be almost uniform. The majoity of clients are around 25 to 65 years of age.

DAYS_EMPLOYED

We expect that tenture may be useful in predicting loan repayment ability. In some cases, individuals that have stayed with an employer for a long period of time are compensated well. Since values are large, the tenure with most recent employer are plotted in years rather than days. Tenure with current employer is very right-skewed. There is a steep decrease in clients tenure from 0 years to approximately 15 years. There appears to be an inflection point around 15 years of tenure and the rate of decrease of the distribution tramatically slows. Clients that were listed as pensioners or unemployed were filtered out of this distribution.

In [20]:
axes = count_values_table(data.NAME_HOUSING_TYPE).plot.bar()
axes.set_xlabel('Housing Type')
axes.set_ylabel('Number of Clients')
axes.set_title('Number of Clients By Type of Housing');
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-69954cdf955b> in <module>
----> 1 axes = count_values_table(data.NAME_HOUSING_TYPE).plot.bar()
      2 axes.set_xlabel('Housing Type')
      3 axes.set_ylabel('Number of Clients')
      4 axes.set_title('Number of Clients By Type of Housing');

~\Google Drive\#Masters\Machine Learning 1\MiniLab1_final\home-credit-default-risk\tables.py in count_values_table(data_series)
     13     count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
     14     count_val_table_ren_columns = count_val_table.rename(
---> 15     columns = {0 : 'Count Values', 1 : '% of Total Values'})
     16     count_val_table_ren_columns = count_val_table_ren_columns.iloc[:, :-1]
     17     return count_val_table_ren_columns

C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    219         @wraps(func)
    220         def wrapper(*args, **kwargs):
--> 221             return func(*args, **kwargs)
    222 
    223         kind = inspect.Parameter.POSITIONAL_OR_KEYWORD

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in rename(self, *args, **kwargs)
   4236         kwargs.pop("axis", None)
   4237         kwargs.pop("mapper", None)
-> 4238         return super().rename(**kwargs)
   4239 
   4240     @Substitution(**_shared_doc_kwargs)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in rename(self, *args, **kwargs)
   1159             # GH 13473
   1160             if not callable(v):
-> 1161                 indexer = self.axes[axis].get_indexer_for(v)
   1162                 if errors == "raise" and len(indexer[indexer == -1]):
   1163                     missing_labels = [

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_indexer_for(self, target, **kwargs)
   4816         if self.is_unique:
   4817             return self.get_indexer(target, **kwargs)
-> 4818         indexer, _ = self.get_indexer_non_unique(target, **kwargs)
   4819         return indexer
   4820 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
   4799             tgt_values = target._ndarray_values
   4800 
-> 4801         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4802         return ensure_platform_int(indexer), missing
   4803 

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()

TypeError: '<' not supported between instances of 'str' and 'int'

NAME_HOUSING_TYPE

Almost all clients reported living in a house or apartment. Only a relatively small portion of clients reported in other types of living situations. The skewness of this categorical variable could be an issue because of the bias toward a single response.

In [ ]:
axes = count_values_table(data.OCCUPATION_TYPE).plot.bar();
axes.set_xlabel('Name of Occupation Type')
axes.set_ylabel('Number of Clients')
axes.set_title('Number of Clients By Occupation Type');

OCCUPATION_TYPE

We expect that the client's type of occupation may affect a client's ability to repay loans. The bar chart shows the counts of occupation type reported by clients. We see that many occupation types were reported. This feature has a high cardinality with low counts in about two thirds in of the categories. Oddly, a large number did not report an occuption type; these are listed as "Unknown." It is possible that too many clients reported no type of occupation for this feature to be useful.

2.4.2 Correlations (Q2E)

To begin the process of exploring the relationships between continuous variables, we rely on two strategies:

  • Scatterplot Matrices
  • Correlation Plots Scatterplot matrices are intended to help visually identify relationships, and evidence of multicollinearity, between predictor variables. Correlation Plots, or heatmaps annotated with the Pearson's Correlation Coefficient of the two predictor variables, will quantify the relationship between the predictors in question. Together, these visualizations provide an initial form of feature selection because relationships between variables will be limited and the relationships to the predictor variable will be maximized.

A limitation of this strategy is that scatterplot matrices can only visualize a limited number of variables at once, but a deeper mathmatical exploration in the form of Component Analysis will be performed as well.

To analyze the distribution of the target variable across the categorical variables, we rely primarily on:

  • Stacked Bar Charts Stacked bar charts are an ideal way of analyzing the distribution of the target variable across the continuous variables.

Additionally, we will view the relationship between the continuous variables and the target variable scatterplot matrices colored by group membership, where the groups are the target variable. This visualization method will help identify variables that provide the greatest discrimination between the two levels of TARGET.

Scatterplot Matrices of Coninuous Variables

  • To begin, we will user scatterplot matrices to identify relationships between groups of variables, without the TARGET variable.
In [ ]:
import matplotlib.pyplot as plt
#from mlxtend.plotting import scatterplotmatrix

cols = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_GOODS_PRICE','AMT_ANNUITY', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
       'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE']
g = sns.pairplot(data[cols])
#g.map_lower(sns.kdeplot, cmap="Blues_d") # use joint kde on the lower triangle
#g.map_upper(plt.scatter) # scatter on the upper
#g.map_diag(sns.kdeplot, lw=3, legend = False) # kde histogram on the diagonal
g.fig.suptitle('Relationships: AMT_INCOME_TOTAL, AMT_CREDIT, AMT_GOODS_PRICE, AMT_ANNUITY', y= 1.08)
plt.tight_layout()
plt.show()

Outliers Identified

The scatterplot matrix revealed the impact of significant outliers in the AMT_INCOME_TOTAL feature. The most significant outlier is more than ten times that of the next greatest income.

Other Significant Findings

A collinear relationship between AMT_GOODS_PRICE and AMT_CREDIT also exists, representing the fact that cash credit is awarded on the basis of the value of the goods purchased. Also, the AMT_CREDIT, AMT_ANNUITY, and AMT_GOODS_PRICE variables appear to have almost identical relationships. The AMT_GOODS_PRICE feature will be considered for elimination, but kept in the training set until the component analysis is complete.

To observe the behavior of applicants based on income, we will bin them into quantiles using the AMT_INCOME_TOTAL feature and plot them against TARGET.

In [ ]:
#create quantiles
CAT_INCOME = pd.qcut(data['AMT_INCOME_TOTAL'], q = 4)
print(CAT_INCOME.head())

We chose to visualize the distribution of loan defaults across income quantiles with a heatmap because it will quickly allow us to assess the distribution of the quantiles and identify the quantiles where the majority of defaults exist in. This knowledge will help determine whether removing the high income outliers from consideration will be detrimental to the model's ability to discriminate between the two levels of the TARGET variable.

In [ ]:
default_income = list(zip(data.TARGET.astype(bool), CAT_INCOME))
def_by_inc = pd.DataFrame(default_income, columns = ['TARGET', 'INCOME_CAT'])
def_by_inc = pd.crosstab(index = def_by_inc.TARGET, columns = def_by_inc.INCOME_CAT).apply(lambda x: (x/x.sum())*100, axis = 1)
def_by_inc.head()

p = sns.heatmap(def_by_inc, vmin = 15, vmax = 40, annot = True, linewidths = .5, cmap = 'coolwarm',
           annot_kws={'fontsize': 10,
                     'verticalalignment':'center'})
p.set(title = 'Defaults by Income Category',
       ylabel = 'Loan Status - True = Default',
       xlabel = 'Income Ranges')

A heatmap of loan defaults by income brackets indicates that a majority of loan defaults occur from the lower end of the income spectrum, within the range of 25,649 and 112,500. This observation points to a modest impact on a model where the highest income earners are removed from consideration.

Correlation Plot

A heatmap is chosen to visualize the correlation matrix of the large scale continuous variables because of the ease of determining multicollinearity among the predictor variables.

In [ ]:
# creating correlation matrix
corr = data[cols].corr()
#generate mask for upper tirangle
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up matplotlib figure
f, ax = plt.subplots(figsize=(11,9))

# Draw heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap = 'coolwarm', center = 0.5, 
            linewidth = .5, cbar_kws={'shrink': .5},
           square = True, annot = True)

The correlation plot reveals a 99% correlation between the AMT_GOODS_CREDIT and AMT_CREDIT features. AMT_GOODS_CREDIT will be removed from consideration because it is collinear with AMT_ANNUITY as well as AMT_CREDIT and will cause interference as we try to determine feature importance.

In [ ]:
g = sns.pairplot(sansOutlier[cols])
g.fig.suptitle('Variables Without the Outlier', y= 1.08)
plt.tight_layout()
plt.show()
In [ ]:
#changing the TARGET variable to categorical for easier visualization
data['TARGET'] = data.TARGET.astype('category')
import seaborn as sns
sns.set(style="ticks")
engFeatures = ['CREDIT_ACTIVE','CREDIT_INCOME_RATIO','ANNUITY_INCOME_RATIO','PERCENT_EMPLOYED_TO_AGE', 'TARGET']

sns.pairplot(data[engFeatures], hue = 'TARGET')
plt.show()

The kernel density plots reveal that applicants with fewer active accounts, total accounts, and less available credit tend to have a higher risk of defaulting on loans, as expected. The outlier default whose income was in excess of 117 million dollars is also clearly evident in most of the plots and would likely have a strong impact on a final model. This provides more evidence for the removal of the millionaire default outlier.

In [ ]:
creditSums = ['AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM', 'TARGET']
sns.pairplot(data[creditSums], hue = 'TARGET')
plt.show()

When analyzing the distribution of hte TARGET variable across the continuous credit features, the impact of the outlier is even more pronounced, making it impossible to view the features that will help identify features critical to predicting defaults.

In [ ]:
defaultsHouse_loan = pd.crosstab([data['NAME_HOUSING_TYPE'], data['NAME_CONTRACT_TYPE']],
                            data.TARGET.astype(bool))
defaultsHouse_loan.plot(kind = 'barh')
plt.suptitle("Defaults by Housing and Loan Type ")
plt.xlabel('Default Count')
plt.ylabel('Housing and Loan Types')
plt.show()

This barplot reveals that a vast majority of the loans in the dataset are cash loans offered to people who own their house or apartment.

In [ ]:
defaultsActiveAcc = pd.crosstab([data.NAME_INCOME_TYPE, data.FLAG_OWN_CAR], data.TARGET.astype(bool))
defaultsActiveAcc.plot(kind = 'barh')

Differencing

Differencing Specifically, a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step. value(t) = observation(t) - observation(t-1)

In [ ]:
#print(data.columns)
flags = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
#flagData = data[flags]
#flag_defaults = flagData.groupby('TARGET').count()
#sns.heatmap(flag_defaults)
#flag_defaults
data[flags].groupby(data.TARGET).describe()
In [ ]:
# I want the frequency (as a percentage) of the target variable across all FLAG features. This doesn't work...
realty_flag = pd.crosstab(index = data.TARGET, columns = data[flags]).apply(lambda x: (x/x.sum())*100, axis = 1)
realty_flag

#pd.crosstab(index = def_by_inc.TARGET, columns = def_by_inc.INCOME_CAT).apply(lambda x: (x/x.sum())*100, axis = 1)

2.5 Feature Engineering (Q2G)

Engineered Loan Application Features

The following features were engineered from the loan application features (from application_train.csv).

Engineered Feature Description Type Units Formula
CREDIT_INCOME_RATIO The percentage credit relative to client's income Numeric N/A AMT_CREDIT / AMT_INCOME_TOTAL
ANNUITY_INCOME_RATIO The percentage annunity relative to client's income Numeric N/A AMT_ANNUITY / AMT_INCOME_TOTAL
PERCENT_EMPLOYED_TO_AGE The fraction of client's days employed. Numeric N/A DAYS_EMPLOYED / DAYS_BIRTH

The following features were engineered from the bureau features (from bureau.csv).

Engineered Feature Description Type Units Formula
LOAN_COUNT The total number of accounts, active and closed. Numeric N/A length of CREDIT_ACTIVE, grouped by loan ID
CREDIT_ACTIVE A count of active credit accounts by loan ID Numeric N/A len(bureau['CREDIT_ACTIVE'] == 'Active')
CREDIT_DAY_OVERDUE A count of days overdue for active credit accounts by loan ID Numeric N/A sum of CREDIT_DAY_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM Total credit available from active accounts Numeric N/A sum of AMT_CREDIT_SUM for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM_DEBT Total debt of active accounts Numeric N/A sum of AMT_CREDIT_SUM_DEBT for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM_LIMIT Overall credit limit of active accounts Numeric N/A sum of AMT_CREDIT_SUM_LIMIT for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM_OVERDUE Total amount overdue Numeric N/A sum of AMT_CREDIT_SUM_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID